For this project, we want to explore how we can classify genres of music. The challenge is to convert an audio file into a form of data that we know how to work with.
To do this, we will use what are called spectrograms. Spectrograms are a graphical representation of the frequency content of a signal as it evolves over time. It is a two-dimensional display where time is plotted on the x-axis, frequency on the y-axis, and the color or intensity represents the magnitude or amplitude of the signal's frequency components. Spectrograms are widely employed in fields like signal processing and acoustics to analyze and visualize audio signals, providing insights into the changing frequencies within a given time frame. This tool is crucial for tasks such as speech analysis, music processing, and various scientific and engineering applications where understanding the temporal and spectral characteristics of a signal is essential.
This report will provide a walkthrough of our analysis, from loading the data to our accuracy results. As usual, we start with loading in all our libraries.
# Tensorflow Libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import image_dataset_from_directory
# Plotting and Model Evaulation Libraries
import matplotlib.pyplot as plt
import IPython.display as ipd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
# Utility Libraries
import numpy as np
import os
# Setting seed for reproducibility
keras.utils.set_random_seed(42)
verbose = 0
We now load in the spectrograms from our folder, and split them into training, validation, and test sets. For the code used to create the spectrograms from the audio files, see Appendix A.
training_images_filepath = "./Data/spec_original"
category_labels = os.listdir(training_images_filepath)
xdim = 180
ydim = 180
spectograms = image_dataset_from_directory(
training_images_filepath,
image_size = (xdim, ydim),
batch_size = 108)
## Use num_batches - 2 batches for training, 1 batch for validation, 1 batch for testing
num_batches = tf.data.experimental.cardinality(spectograms).numpy()
train = spectograms.take(num_batches - 2).cache()
remaining = spectograms.skip(num_batches - 2)
validation = remaining.take(1).cache()
test = remaining.skip(1).cache()
Found 1080 files belonging to 11 classes.
Before continuing, we can get a sense of what some of our spectrograms look like. Below we will output the first five spectrograms in the validation set and their corresponding label.
for images, labels in validation:
plt.figure(figsize=(15, 500))
for i in range(5):
plt.subplot(1, 5, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(f"Label: {category_labels[labels[i].numpy()]}")
plt.axis("off")
plt.show()
We will now start building the model to classify the spectrograms. To do this, we will fine-tune the VGG16 model like we did in lab 10.
conv_base = keras.applications.vgg16.VGG16(
weights = "imagenet",
include_top = False,
input_shape = (xdim, ydim, 3))
conv_base.summary()
Model: "vgg16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 180, 180, 3)] 0
block1_conv1 (Conv2D) (None, 180, 180, 64) 1792
block1_conv2 (Conv2D) (None, 180, 180, 64) 36928
block1_pool (MaxPooling2D) (None, 90, 90, 64) 0
block2_conv1 (Conv2D) (None, 90, 90, 128) 73856
block2_conv2 (Conv2D) (None, 90, 90, 128) 147584
block2_pool (MaxPooling2D) (None, 45, 45, 128) 0
block3_conv1 (Conv2D) (None, 45, 45, 256) 295168
block3_conv2 (Conv2D) (None, 45, 45, 256) 590080
block3_conv3 (Conv2D) (None, 45, 45, 256) 590080
block3_pool (MaxPooling2D) (None, 22, 22, 256) 0
block4_conv1 (Conv2D) (None, 22, 22, 512) 1180160
block4_conv2 (Conv2D) (None, 22, 22, 512) 2359808
block4_conv3 (Conv2D) (None, 22, 22, 512) 2359808
block4_pool (MaxPooling2D) (None, 11, 11, 512) 0
block5_conv1 (Conv2D) (None, 11, 11, 512) 2359808
block5_conv2 (Conv2D) (None, 11, 11, 512) 2359808
block5_conv3 (Conv2D) (None, 11, 11, 512) 2359808
block5_pool (MaxPooling2D) (None, 5, 5, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________
For the first step of our fine-tuning, we will freeze the base model and add in our own head to train. In this part we will train a small variety of different heads and measure their results. We will then move forward with only the best model.
conv_base.trainable = False
inputs = keras.Input(shape=(xdim, ydim, 3))
x = keras.applications.vgg16.preprocess_input(inputs)
x = conv_base(inputs)
x = layers.Flatten()(x)
x = layers.Dense(256, activation = "relu")(x)
outputs = layers.Dense(len(category_labels), activation="softmax")(x)
model_simp = keras.Model(inputs, outputs)
model_simp.compile(loss="sparse_categorical_crossentropy",
optimizer="rmsprop",
metrics=["accuracy"])
history_simp = model_simp.fit(
train,
epochs = 50,
validation_data = validation,
verbose = verbose)
plt.plot(history_simp.history["accuracy"])
plt.plot(history_simp.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);
The simple model by itself gets about a 90% accuracy.
We will now add some more layers and see if we can improve performance.
conv_base.trainable = False
inputs = keras.Input(shape=(xdim, ydim, 3))
x = keras.applications.vgg16.preprocess_input(inputs)
x = conv_base(inputs)
x = layers.Flatten()(x)
x = layers.Dense(256, activation = "relu")(x)
x = layers.Dense(256, activation = "relu")(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(len(category_labels), activation="softmax")(x)
model_med = keras.Model(inputs, outputs)
model_med.compile(loss="sparse_categorical_crossentropy",
optimizer="rmsprop",
metrics=["accuracy"])
history_med = model_med.fit(
train,
epochs = 50,
validation_data = validation,
verbose = verbose)
plt.plot(history_med.history["accuracy"])
plt.plot(history_med.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);
Surprisingly, the model with an extra dense layer and a dropout layer does not see an increase in performance. Between the simple model and the medium model, we should use the simple model because it has the same performance with less complexity.
For our final model, we will test a very complex network with five layers and a dropout layer.
conv_base.trainable = False
inputs = keras.Input(shape=(xdim, ydim, 3))
x = keras.applications.vgg16.preprocess_input(inputs)
x = conv_base(inputs)
x = layers.Flatten()(x)
x = layers.Dense(256, activation = "relu")(x)
x = layers.Dense(128, activation = "relu")(x)
x = layers.Dense(64, activation = "relu")(x)
x = layers.Dense(32, activation = "relu")(x)
x = layers.Dense(16, activation = "relu")(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(len(category_labels), activation="softmax")(x)
model_comp = keras.Model(inputs, outputs)
model_comp.compile(loss="sparse_categorical_crossentropy",
optimizer="rmsprop",
metrics=["accuracy"])
history_comp = model_comp.fit(
train,
epochs = 50,
validation_data = validation,
verbose = verbose)
plt.plot(history_comp.history["accuracy"])
plt.plot(history_comp.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);
Despite being more complex, this model does worse than the first two models. The complexity of this model is overfitting the data.
Moving forward, we will use the simple model as the head. Now that the head of our model is moderately trained, we can unfreeze the last four layers of the VGG16 model and fine-tune them.
conv_base.trainable = True
for layer in conv_base.layers[:-4]:
layer.trainable = False
model_simp.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.RMSprop(learning_rate=1e-5),
metrics=["accuracy"])
history = model_simp.fit(
train,
epochs = 10,
validation_data = validation,
verbose = verbose)
plt.plot(history.history["accuracy"])
plt.plot(history.history["val_accuracy"])
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(["Training Set", "Validation Set"]);
Fine-tuning in this step does not seem to improve the accuracy of the model further, but it is still an important step since it very well could have.
Now that we have our model, we can test its performance on the test set.
predictions_prob = model_simp.predict(test)
predictions = np.argmax(predictions_prob, axis = 1)
ground_truth = [label for _, label in test.unbatch()]
ground_truth = tf.concat(ground_truth, axis = 0).numpy()
accuracy = accuracy_score(ground_truth, predictions)
print("Accuracy of the model:", accuracy)
1/1 [==============================] - 3s 3s/step Accuracy of the model: 0.9259259259259259
We get an accuracy of 92.6%, which is very strong. We can further analyze the performance by evaluating the confusion matrix.
fig, ax = plt.subplots(figsize=(12,8))
conf_matrix = confusion_matrix(ground_truth, predictions)
ConfusionMatrixDisplay(conf_matrix, display_labels = category_labels).plot(ax = ax);
From the confusion matrix we can see that most of our misclassifications are one-off misclassifications. Most of these misclassifications make sense, such as metal being misclassified as rock or hiphop being misclassified as pop. One misclassification that stands out is a jpop audio track being misclassified as classical.
To get a better sense of our performance, we can look at the precision, recall, and F1-score of each class. We can write a function that will print out the results. This function can also identify the images that were misclassified and display them.
def multiclass_accuracy_results(y_true, y_pred, test_images):
labels = list(set(y_true))
misclassified_indices = []
for label in labels:
new_vec_true = y_true == label
new_vec_pred = y_pred == label
misclassified_indices.extend(np.where((new_vec_true != new_vec_pred) & new_vec_true)[0])
print(f'Class {label}:')
print(f'Accuracy: {accuracy_score(new_vec_true, new_vec_pred)}')
print(f'Precision: {precision_score(new_vec_true, new_vec_pred)}')
print(f'Recall: {recall_score(new_vec_true, new_vec_pred)}')
print(f'F1-Score: {f1_score(new_vec_true, new_vec_pred)}')
print()
misclassified_indices = np.unique(misclassified_indices)
# Display misclassified images side by side
plt.figure(figsize=(15, 3))
for i, idx in enumerate(misclassified_indices):
plt.subplot(1, len(misclassified_indices), i + 1)
plt.imshow(test_images[idx].numpy().astype("uint8"))
plt.title(f'True: {y_true[idx]}, Predicted: {y_pred[idx]}')
plt.axis('off')
plt.show()
return None
test_images = [image for image, label in test.unbatch()]
multiclass_accuracy_results(ground_truth, predictions, test_images)
Class 0: Accuracy: 0.9814814814814815 Precision: 0.8571428571428571 Recall: 0.8571428571428571 F1-Score: 0.8571428571428571 Class 1: Accuracy: 0.9907407407407407 Precision: 0.8333333333333334 Recall: 1.0 F1-Score: 0.9090909090909091 Class 2: Accuracy: 0.9907407407407407 Precision: 0.9090909090909091 Recall: 1.0 F1-Score: 0.9523809523809523 Class 3: Accuracy: 1.0 Precision: 1.0 Recall: 1.0 F1-Score: 1.0 Class 4: Accuracy: 0.9814814814814815 Precision: 0.9333333333333333 Recall: 0.9333333333333333 F1-Score: 0.9333333333333333 Class 5: Accuracy: 1.0 Precision: 1.0 Recall: 1.0 F1-Score: 1.0 Class 6: Accuracy: 0.9907407407407407 Precision: 1.0 Recall: 0.9230769230769231 F1-Score: 0.9600000000000001 Class 7: Accuracy: 0.9814814814814815 Precision: 0.9230769230769231 Recall: 0.9230769230769231 F1-Score: 0.9230769230769231 Class 8: Accuracy: 0.9907407407407407 Precision: 0.9090909090909091 Recall: 1.0 F1-Score: 0.9523809523809523 Class 9: Accuracy: 0.9814814814814815 Precision: 1.0 Recall: 0.8823529411764706 F1-Score: 0.9375 Class 10: Accuracy: 0.9629629629629629 Precision: 0.8 Recall: 0.8 F1-Score: 0.8000000000000002
The jpop image that was misclassified as classical is True: 6, Predicted: 1. We can play the jpop audio clip to see what it sounds like.
ipd.Audio('./Data/jpop00009.wav')